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1. Introduction 



Finite mixture models have long been used as a way to model a sample of ob- 
servations that arise from a number of (usually) a priori known classes with 

1 
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unknown proportions. For example, consider inferring the strategies young chil- 
dren employ when presented with a cognitive task. The children's performance 
on the task may be modeled using finite mixtures with the components pertain- 
ing to the different strategies (Thomas and Horton (1997)). Or perhaps, consider 
attempting to classify a group of individuals by the way they each speak the 
same word - a challenging problem due to context dependencies (i.e. different 
points in time, gender, etc.) in speech recognition. Here, a finite mixture may 
be used with the components pertaining to different 'vowel classes' of spoken 
words (Peng ct al. (199G)). Or finally, consider assessing the service quality of 
banking institutions. The institutions may be modeled using finite mixtures 
with components pertaining to different market segments (Wedel and DeSarbo 



Studies such as those mentioned above have the roots of their analyses 
grounded in a seminal work by Pearson (1894). In this article, Pearson (1894) 
was one of the first individuals to incorporate the use of mixture models as 
well as note some of the issues surrounding them - in particular estimation and 
identifiability. These are issues still prominent in today's mixture research and 
they will be addressed in this work. 

Since the time of the Pearson (1894) article, a great deal of literature has 
emerged in many disciplines regarding mixture models. In addition to the numer- 
ous technical and cross-disciplinary articles on mixture modeling, monographs 
concerning the subject include Everitt and Hand (1981), Tittcrington ct al. (1985), 
Lindsay (1995), and McLaclilan and Peel (2000). This article addresses many of 
the issues presented in these monographs, as well as current work, but at a 
high-level overview. 

2. A General Mixture Model 

Suppose we have n subjects where we take a series of m measurements, say 
Y,; = (li_i, . . . ,Yi^jn)'^ , on the i*'' subject for i = 1,. . . ,n. Furthermore, take 
y^, . . . , y„ as realized values of the Y^'s, which are independent and identically 
distributed (iid) according to a distribution F. In this scenario, standard multi- 
variate techniques (Johnson and Wichcrn (2002) and Anderson (2003)) can be 
employed to estimate the common population mean vector, fi, and the popula- 
tion variance-covariance matrix, E. 

Suppose, in addition to the above scenario, there is an assumed heterogeneity 
with respect to the response tendencies of the subjects. One way to account for 
this heterogeneity is by suggesting k different classes to which the subjects 
could essentially belong. Assuming fixed k, the distribution of the Y^'s has the 
k- component mixture density 



(1994)). 



k 




(2.1) 



where Xj > and J2j=i — ^ weights (or mixing proportions) for the 

components of the model. (The subscript for / will be suppressed except when 
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to stress the dependence of / on fc.) Furthermore, define 

Afc-i = |(Ai,A2,...,Afe_i) :^^A, < 1, A, € (0, 1) Vjj c R^^^i, 

where A^ has been arbitrarily omitted since Afc = 1 — j=i ■ The gj 's are 
known component densities, parameterized by 6j g Qj C R'Jj such that 0^ 
represents the specified parameter space for the Oj's. The mixture density / is 
parameterized hy £ ^ such that * represents the specified parameter space 
for all unknown parameters in the mixture model. Note that 




where ^ C M'' and r = Qj) + k — ^- We will take F as the correspond- 

ing fc-component mixture distribution whose components are composed of the 
distributions Gj. For the scenarios presented in this work, the Gj differ only 
in 9j, thus wc will take gj = g and qj = q which yields \I/ = O*"' x Afe_i and 
r = kq + k — I. 



3. Estimation 

In this section, we focus on estimation of the parameters of a mixture model, 
ijj, given Y and k. Early works employed a method of moments approach (for 
instance, Pearson (1894) and Quandt and Ramsey (1978)), but with the advent 
of more efficient computing, numerous algorithms have emerged as tools for 
estimation in the mixture setting. These techniques can be classified into two 
primary categories: likelihood methods and Bayesian methods. We will provide 
a brief literature review on some of the available techniques and provide a more 
complete description of a few commonly employed algorithms. We will close this 
section with one final issue which concerns estimating the number of components 
when this is not known a priori. 



3.1. Likelihood Methods 

The likelihood for the parameters of a mixture model, ip, can be easily formu- 
lated using the mixture density in (2.1) as 

71 

i(V';y) = n/(y^;^)- 

1=1 

In dealing with likelihood methods, it is often easier to work with the log like- 
lihood: 

n 

£(tA) = logi(tA; y) = ^ log /(y,; xj,). (3.1) 

i=l 
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Then, an estimate tp (the MLE) is provided by solving 

where S'(y; tp) is caUed the score function. 

It is necessary to consider the possibiUty of multiple local maxima since the 
likelihood will have multiple roots. Moreover, the likelihood function may be 
unbounded, which becomes a considerable concern when implementing various 
algorithms (as will be discussed). Focusing on local maxima on the interior of 
the parameter space (denoted by helps circumvent this problem because 
under certain regularity conditions, there exists a strongly consistent sequence 
of roots to the likelihood equation that is asymptotically efficient (see Ferguson 
(1996)). In fact, a -y/n-consistent estimator can be constructed using the method 
of moments estimator mentioned earlier. For a deeper treatment of the choice 
of root, as well as testing for a consistent root, refer to McLachlan and Peel 
(2000). 

The rate of convergence for the likelihood methods to be discussed will also 
be considered. Consider a norm || • || on * and a sequence ct such that ct as 
t 00. If the sequence of iterates {'«/''■*-'} draws sufficiently close to a solution 
ip* of (3.2), then the rate of convergence is given by 

--0*11 < ctllV'*^ --0*11^ (3.3) 

where t = 0,1, . . . and q> 1. When replacing the sequence {q} by a constant c, 
then we refer to q in (3.3) as a local rate of convergence. An illustration of local 
linear (g = 1) and local quadratic {q = 2) convergence is given in Figure 1. 
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3.1.1. Newton Methods 

An efficient way for solving (3.2) is to implement a Newton-type method. The 
Newton-Raphson method takes a linear Taylor series expansion about the cur- 
rent fit t/j^*^ for i/), which yields 



5(y;i/>)«5(y;i/>(*)) 

where 

/(V';y) 

which is the negative of the Hessian 
hand side of (3.4) yields the update 



-/(V(*);y)(tA-V'*)), (3.4) 



of £{^p). Then, finding a zero for the right 



,/,(*+!) =V>(*) + [/(V'(*);y)]-^5(y;tA(*)). (3.5) 

The Newton-Raphson method has the benefit of local quadratic convergence 
to a solution if}* of (3.2), but this convergence is not guaranteed. Aside from 
some other computational issues (as noted in McLachlan and Krishnan (1997)), 
Newton-Raphson has the benefit of providing, as an estimate of the variance- 
covariance matrix of the solution, the inverse of the observed information matrix, 
[/('i/'*; y)]~^- Thus, standard error estimates, confidence intervals, and inference 
procedures are readily available. 

One may also implement a quasi-Newton method by replacing I{t})^^^\y) in 
(3.5) by A, an approximation to the negative Hessian matrix. This yields the 
update 

-KA-i5(y;t/.(*)). 

While evaluation of the Hessian is avoided at each iteration, yielding a lower cost 
of computation, some drawbacks with this method are that the local quadratic 
convergence of the regular Newton-Raphson method is lost, convergence is not 
guaranteed, and erratic estimates for (.{xjj) may be obtained if a poor value of 
A is used. 

One final Newton-type method is Fisher's method of scoring. This method 
replaces /(■j/'^*^;y) in (3.5) by the expected (Fisher) information matrix, 

//(tA) = E^{5(Y;V)[5(Y;t/;)]T} 
= -E^{/(V;Y)}, 

evaluated at the current fit ■0'*^ for xjj. This yields the update 

0(*+i) =-0'*' + [//('0(*))]"'5'(y;-</'(*'). 

Another version of Fisher scoring uses the empirical information matrix, 

i=l i=l 
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yielding the update 

^(*+i)=V.(*) + [/,(V;(*))]-i5(y;t/;(*)). 

With both methods, one is relegated to local linear convergence and convergence 
is again not guaranteed. 

3.1.2. EM Algorithms 

As seen in the previous section, Newton methods can provide relatively 'speedy' 
convergence, but this convergence is not ensured and calculations like invert- 
ing the Hessian may be rather difficult to perform. An alternative is the use 
of Expectation-Maximization (EM) algorithms, which were popularized in the 
mixture modeling literature after the article by Dempster et al. (1977). We will 
focus on developing an EM algorithm for the mixture case, but it should be 
noted that this algorithm is one member in a much larger class of algorithms 
(see McLachlan and Krishnan (1997) and McLachlan and Peel (2000) for a dis- 
cussion) . 

Wc construct an EM algorithm for mixtures by first introducing the indicator 
random variable 

Z, j = ijobscrvation i belongs to component j}, 

for i = 1, . . . , n and j = I, . . . ,k such that !{•} is the indicator function. We 
refer to the measurements, Y, from earlier as the incomplete or observed data 
and (Y, Z) as the complete data. Here we use Y and Z to denote all of the Y^'s 
and Zij's, respectively. 

The observed data log likelihood is simply i(^p) from (3.1), but it will be 
denoted as £o{tp) when the meaning is not made clear by the context. The 
complete data log likelihood is given by 

n k 

e^i^f,) = logl['[[[x,9{y,■e,)f^- 
i=lJ=l 

n k 

= EEzMi°g[^^-9(y,;;^,)]- 

i=l 3 = 1 

ic{tp) is introduced to deal with the intractability of maximizing ioi'ip) with 
respect to tp. With the formal notation defined, we now construct an EM algo- 
rithm for mixture models. 

Algorithm 3.1 (EM Algorithm). 

1. Given a fixed i/'^*^ at the t*'' iteration, t = 0,1, . . ., calculate 

Q(t/;;t/>(*)) ^i?^„[4(V)|i^=y]. (3.6) 
This step is referred to as the Expectation Step or E-Step. 
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2. Find 

= axgmaxQ(i/);i/>(*)), 

which implies 

Q(t/.(*+i);t/;(*))> Q(^;t/,(*)) 

for allxj} £ . This step is referred to as the Maximization Step or M-Step. 

3. Iterate until a stopping criterion is attained. The final estimate obtained 
will be denoted by ■0. 

Notice ~ Bern(Xj), where Bern(Xj) is taken to mean the BernouUi dis- 
tribution with rate of success Xj, and Zij is independent of Yj» for all i* ^ i. 
Since E^(t) is a linear functional, the right hand side of (??) and (3.6) allows us 
to replace Zij by 



E^[Z,:,,|Y = y] = 



■^jg(yn^j) 
E/=i A/5(y,;6ii)' 



which follows from an application of Bayes' rule and the law of total probability. 
Thus, when provided the estimate ip^*\ we get 



(t) _ Xj 9{y^,(^J ) 



Also note in the E-Step, as stressed in Fhny and Zoppc (2000), is that the ex- 
pectation of the complete data log likelihood is conditioned on the observed data 
and it does not strictly replace missing data by their conditional expectations. 

As can be seen, the structure for an EM algorithm is rather simple and 
thus programming is relatively easy. While there have been some technical is- 
sues about the Dempster ct al. (1977) article addressed over the years (such 
as convergence results noted by Wu (1983)), we will discuss a couple of issues 
concerning implementation of Algorithm 3.1. 

One issue concerns selection of the initial values (i/j '"■'). Due to the multi- 
modality in the mixture likelihood, there are multiple local maxima and in some 
cases, a poor choice of can lead to the sequence of EM estimates diverging. 
Due to such features, it is strongly advocated to start EM algorithms from many 
different initial values. We will use a simple binning procedure to determine hy- 
perparameters for distributions used in random generation of the starting values. 
For reviews of possible options for starting values, see McLachlan and Krishnan 
(1997) or McLachlan and Peel (2000). 

Another issue concerns the stopping criterion. Usually an EM algorithm is 
run until 

£o{iP^'+^'>) - ioi¥'^) < e, (3.7) 
or, when given a norm j| • || on 4*, until 
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for some e > chosen arbitrarily small. Schafcr (1997) discusses the stopping 
criterion 

^« 

for Z = 1, 2, . . . , r, though this method fails when -0;''*'' « 0. Regardless, EM al- 
gorithms converge linearly, which can be very slow at times. An inappropriate 
stopping criterion may cause one to claim convergence too soon. Certain meth- 
ods, such as an Aitken-based acceleration technique, may be implemented to 
alleviate some of the difficulty with the slow rate of convergence (see Lindsay 
(1995) for a discussion). We use the method in (3.7) as our stopping criterion. 

Numerous EM-type algorithms can be found in the literature (see McLacliIan and Krishnan 
(1997) and McLaclilan and Peel (2000) for references). A useful extension of the 
EM algorithm is the Expectation / Conditional-Maximization (ECM) algorithm 
ofMeug and Rubin (1993). Consider a partition of say * = (*7' *Ji ■ • ■ i 
such that s <p. 

Algorithm 3.2 (ECM Algorithm). 

1. For a given t = 0, 1, . . calculate 

Q(V^;t/>(')) ^i?^,.) [4(^)1 r=y]. 

ij}^'^^ will be a specified initial value. This E-Step is the same step as in the 
EM algorithm of Algorithm 3.1. 

2. For each z = 1, 2, . . . , calculate 

V'i*^^'' = argmax Q(i/); ■0^*-'), 

where xpi-^ is fixed at xpf^ for all ii > i and xpi^ is fixed at ipf^'^^^ for all 
12 < i. These steps are referred to as the Conditional Maximization-Steps 
or CM-Steps. 

3. Iterate until a stopping criterion is attained. The final estimate obtained 
will be denoted by xp. 

There is also a multicycle ECM algorithm as given in Liu and Rubin (1994). 
This algorithm incorporates an additional E-Step between some or all of the 
CM-Steps. 



3.1.3. MM Algorithms and Adaptive Barrier Methods 

The EM algorithms of the previous section are special cases of MM algorithms, 
which are prescriptions for constructing such optimization algorithms. Since 
we present the EM algorithm in the context of maximization, MM stands for 
minorize / maximize. When in the context of minimization, MM stands for 
maximize / minorize. For our setting, MM algorithms operate by creating a 
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surrogate function (which wc denote by h) to drive an objective function uphiU. 
In the context of our mixture model, f/'^*'') minorizes (oitp) at -0'^*^ provided 

and 

L{ip) > for all V- 

After choosing a minorizing function, we then maximize it. In the EM setting, 
the minorizing function at ■0^*-' (shifted by a constant) is Q(i/'; V'^*'')- 

We do not go into much detail about MM algorithms here, but one may refer 
to Langc (2004) or Hmitcr and Lange (2004) for discussion. Our brief definition 
of MM algorithms provides a segue from EM algorithms to adaptive barrier 
methods, which are used in constrained optimization. 

Suppose we have an ECM setting where in one of the CM-Steps 

■^l*"*"^^ = argmaxQ(i/);t/;'^*^) 
= argmaxTO('0;), 

where ^pl g MJ' for some r* < r, the total number of parameters. In other words, 
we focus only on the portion of the parameter vector over which we are actually 
maximizing. Suppose further that m{xpi) is twice continuously differentiable 
subject to the linear inequality constraints Ci~ujxpi > for 1 < i < n. Provided 
'4''t~^^^ is in the interior of ^f;, maximization of m{'ipi) can be transferred to the 
surrogate function 

where /i > is a barrier constant. The barrier function (the portion of the 
surrogate function involving ^) forces to remain within the interior of 

the feasible region (i.e., the region satisfying the constraints). 

As Lange (1999) points out, it is impossible to maximize /i('0i; ■»/'[*•') ex- 
plicitly in most problems, so maximization using a quadratic approximation 
is suggested. However, it is often sufficient to perform a few steps of Newton- 
Raphson, thus avoiding the seemingly more complex quadratic approximation 
method. Further details on adaptive barrier methods in convex programming 
may also be found in Lange (1994). 

3.2. Bayesian Methods 

A Bayesian approach can be taken for estimation in mixture models provided a 
proper prior is used (i.e., a prior that sums or integrates to a finite value for the 
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discrete and continuous case, respectively). With the advancement in comput- 
ing power, developments in Markov Chain Monte Carlo (MCMC) algorithms 
have made Bayesian analyses an appealing method for analyzing mixture mod- 
els. McLachlan and Peel (2000) provide many references to Bayesian mixture 
analysis. The discussion we provide below will be from the perspective where 
the data are continuous, but the discrete case is analogous. 

Let Lo{ip) and Lc{ip) denote the observed data likelihood and complete data 
likelihood (the antilogarithms of io{tl') a^nd ^c('/')j respectively). Let z denote the 
realization of the component indicator random variable Z. Denote the proper 
prior density for as 7r(i/)) and the conditional density for Z given — ip as 
7r(z; The posterior density of ^p is then given by 

7r(^)L„(t^) 



^ Ez^(z;V')^(^)^e('»/') 

= K-'Y,7t{z;iP)7t{iP)L,{'iP), 

z 

where K denotes a normalizing constant. Now we are treating the mixture pa- 
rameters as random variable quantities. We partition ^ and denote the (inde- 
pendent) prior distributions on Afc_i and Q'^ by nA(A) and IIq{6), respectively. 
For the prior on the mixing proportions, we will always use nA(A) ~ Dirk{a), 
such that Dirk{a) is taken to mean the Dirichlet distribution with parameter 
vector a = {ai, a2, ■ ■ ■ , ak)"^ , where aj > for all j = 1, . . . , k. We use these pri- 
ors in outlining two MCMC algorithms (in a mixture context) that are common 
for posterior simulation: Gibbs samplers and Metropolis-Hastings algorithms. 

Using a Gibbs sampler (German and Gcman (1984)), we may simulate from 
each element of ip by conditioning on the current values of the other elements 
in ijj. With the formal notation defined, we can now construct a Gibbs sampler 
for mixture models. 

Algorithm 3.3 (Gibbs Sampler). 

1. Choose initial values t/j'"-' and Z^^K 

2. For a given t, such that i = 1, 2, . . ., simulate 



and 



X('^~D^rkL + j2Z^^~'^) 
^ i=i ' 



/or all i = l,...,fc. Here, Zi and Zj are vectors denoting the i^^ row 

and J*'' column of Z, respectively, and IIq{6; zfj^ ^^) is the conditional 
distribution of 6 given the previous iteration's value of Zj. 
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3. Simulate 

Z^P ^ Multk{l,'df), 

where Multk{l, ,[*'') is taken to mean the multinomial distribution consist- 
ing of one draw from k bins with probability of success vector 



4-. Increment t and repeat steps 2 and 3. 

As for a Metropolis-Hastings algorithm (Metropolis et al. (1953) and Hastings 
(1970)), a little more programming is usually necessary since there is the require- 
ment of a proposal density {q{tp*; xp)) to effectively search the entire parameter 
space. Many times, this density is chosen to be symmetric (i.e., (7(1/'*; t/j) = 
g('0;i/j*)), but as Gill (21)02) points out, this is not necessary. The decision 
about whether we accept a value, i/)*, from this proposal density will be based 
on the acceptance ratio, 

a(V.*;^) = ^(^^^*)^(^*) 



With the formal notation defined, we can now construct a Metropolis-Hastings 
algorithm for mixture models. 

Algorithm 3.4 (Metropolis-Hastings Algorithm). 

1. Choose initial values xp^^h Let t — 0. 

2. Sample xp* from q{xl>*;xp^*'^). 

3. Generate u ~ Unif(0, 1), such that Unif{0, 1) is taken to mean the uniform 
distribution over the interval (0,1). 

I Set 

,u(t+i)^ / V'*, a(-0*;'0'*') > u; 
^ \ t/jW, a(t/j*;-0(*)) < u. 

5. Increment t and repeat steps 2 through 4- 

Once we have an MCMC sample from the posterior, we may perform infer- 
ence for the parameters. This is in contrast to likelihood methods, that give 
only maximum likelihood estimates and an estimate of its sampling distribution 
variance-covariance matrix. 

We should note some issues when implementing these and other MCMC 
methods as found in Robert and Casclla (2004). For instance, choosing a pro- 
posal density q{ip*; ip) may require one to incorporate some sort of tuning pa- 
rameter (see Chib and Grecnberg (1995) and Cappc and Robert (2000)). There 
are also practical issues such as thinning the chain, burn-in, and selecting initial 
values. A major problem in using MCMC methods to estimate parameters in 
the mixture setting is label switching, which will be addressed later. 
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3.3. Number of Components 

Determining the number of components for (2.1) is still a major contemporary 
issue in mixture modeling. We will address here some of the techniques used in 
assessing the number of components when this is not known a priori. 

3.3.1. Likelihood Approaches 

From a likelihood perspective, we consider testing 

-ffo : k = ko 

Hi : fc = fco + 1 (3.8) 

for some positive integer fco. Letting and ■02 denote the MLEs of calculated 
under Hq and Hi, respectively, we could consider the likelihood ratio test (LRT) 
statistic 

-21ogA = 2{^(t/,i)-£(V^o)}. (3.9) 

It is well known that standard regularity conditions do not hold in the setting of 
(3.8) and thus the asymptotic distribution of (3.9) is not the usual chi-squared 
distribution (see Aitkin and Rul)in (1985) and Lindsay (1995) for a discussion). 
However, model selection techniques are still used in assessing the overall num- 
ber components as simulations have indicated relatively good empirical results 
(see McLachlan and Peel (2000) for references). We recommend not using these 
techniques solely in determining the number of components of a mixture model, 
but rather to give further supporting evidence to the number selected based on 
another method, such as a bootstrapping technique (to be discussed later). 

Four common model selection criteria are Akaike's information criterion (AIC) 
of Akaike (1973), the Bayesian information criterion (BIC) of Schwarz (1978), 
the Integrated Completed Likelihood (ICL) of Bicrnacki ct al. (2000), and the 
consistent AIC (CAIC) of Bozdogan (1987). Given an estimate tp, the form of 
these criteria are, respectively, 

AIC = £{ip) - r 

BIC = l{^)-Uog{n) 

k 

ICL = BIC-^Ajlog(Aj) 

CAIC = £(^)-^(log(n) + l), 

where r = kq + k — 1 is the number of parameters in the mixture setting. 
These values are calculated for a reasonable range of components and then the 
maximum of these values (for each criterion) corresponds to the number of 
components selected by that criterion. 



D. S. Young/An Overview of Mixture Models 



13 



As an alternative to penalized likelihood methods, Chen and Kalbfleisch (1996) 
present a penalized minimum-distance estimate. They argue that the penalized 
likelihood approach tends to produce a fit with fewer components. However, 
it is unknown whether or not this approach produces a consistent estimate of 
the number of mixture components. Hence, they use the penalized minimum- 
distance estimate, which they show to be consistent for the number of mixture 
components as well as the mixing distribution. Their method is used for any of 
the common distances, such as the Kolmogorov-Smirnov distance, the Cramer- 
von Mises distance, and the KuUback-Liebler information. While the last is not 
symmetric, it can be viewed as a distance and used when two distributions have 
common support. In fact, one may use a symmetrized version of the KuUback- 
Liebler distance to avoid the symmetry issue. The interested reader may refer to 
Chen and Kalbfleisch (1996) for applications of the penalized minimum-distance 
method. 

A commonly employed method in determining the number of components 
is a bootstrapping scheme proposed by McLachlan (1987). The algorithm is an 
attempt to approximate the null distribution of the LRT statistic values given in 
(3.9). We outline the algorithm for this parametric bootstrapping scheme using 
an EM algorithm as follows: 

Algorithm 3.5 (Parametric Bootstrapping the LRTs for Number of 
Components). 

1. Fit a mixture model with kg and kg+l components to the data, j/j, t/j, . . . , j/„, 
which leads to the EM estimates ipi and xp2, respectively. 

2. Calculate the (observed) log likelihood ratio statistic in (3.9). Denote this 
value by Eobs . 

3. Simulate a data set of size n from the null distribution ( the model with fco 
components) . Call this sample y^, yi,, . . . , y^. 

4-. Fit a mixture model with ko and fco -I- 1 components to the simulated data 
and calculate the corresponding 'bootstrap' log likelihood ratio statistic. 
Denote this value by S* . 

5. Repeat steps 3 and 4 B times to generate the bootstrap sampling distribu- 
tion of the likelihood ratio statistic, 5*, Sj, . . . , S^. 

6. Compute the bootstrap p-value as 

1 ^ 



Algorithm 3.5 is implemented by first testing 1 versus 2 components. A value 
oi Pb is obtained for this test and if it is lower than some significance level a, 
then claim statistical significance and proceed to test 2 versus 3 components. If 
not, stop and claim that there is not statistically significant evidence for a 2- 
component fit. Proceed in this manner until you fail to reject the null hypothesis. 

Exact theoretical results for testing (3.8) have been obtained in numerous 
special cases. As Lindsay (1995) points out, some of these testing scenarios yield 
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limiting distributions that citlicr resemble mixtures of chi-squared distributions 
of different degrees of freedom (called a chi-har- squared distribution) or can, in 
fact, be shown to be a chi-bar-squared distribution. One special case is when 
fco = 1 hi (3.8). Lindsay (1995) shows the limiting distribution for —2 log A in 
this case is 

\xl + \xl (3.10) 

where Xp denotes the chi-squared distribution with p degrees of freedom. 

Notice that (3.10) is just a linear combination of chi-squares. Because of this 
fact, there is no guarantee that the parametric bootstrap outlined in Algorithm 
3.5 will give a good approximation. An example where a statistic is asymp- 
totically distributed as a linear combination of chi-squares and the parametric 
bootstrap approximation fails can be found in Babu (1984). While this does 
present theoretical difficulties, it does not appear to be an issue often encoun- 
tered in practice. McLaclilan and Peel (2000) present simulation results and cite 
many references that endorse this method by assessing the accuracy oi ps as 
well as the overall power of the test. 

3.3.2. Bayesian Approaches 

In addition to the likelihood methods presented, there are a few Bayesian pro- 
cedures concerning estimating the number of components. One method is the 
Dirichlet process (Ferguson (1973)). For a parametric mixture model, rewrite 
(2.1) as 

f{yf,^P) = i2^My^■^(^l) = j 9{y^,e)dP{ei 

where the mixing distribution P is defined as X)j=i ^i^j)- Here, S{-) represents 
the Dirac measure on the parameter space O, meaning that P is a discrete 
distribution that puts mass Aj on 6j . Since the number of components is now 
random, the problem can be thought of as selecting a model out of the set of 
all possible mixture distributions on O. Thus, it is necessary to specify some 
sort of prior on this set. One way of accomplishing this is by implementing the 
Dirichlet process to obtain the prior on the set of all distributions on O. 

The focus is on P, which is a distribution on Q drawn from the Dirichlet 
process with parameter a, such that a is a finite measure on Q. While the 
details of the Dirichlet process (Ferguson (1990)) are beyond the scope of this 
work, an easier way to think about the Dirichlet process is what is referred to 
as Sethuraman's representation: 

Algorithm 3.6 (Sethuraman's Representation of the Dirichlet Pro- 
cess). 

1. Let a be the probability measure a/a{Q) and take 61,62, .. . as independent 
and identically distributed from a. 
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2. Take 71, 72, ... as independent and identically distributed from Beta{l, a(0)), 
chosen independently of 61,62, ■■ such that Beta{a,b) is taken to mean 
the beta distribution with shape parameters a and b. 

3. Define 

1*1 = 71 

72 = 72(1-71*) 



, k-l 

7fc =7fc(l-^7; 
^ j=i 



4. Let P ~ Sj^i Ij^i^j)' ^'^ ^^"^^ P puts mass 7* at dj. 

Scthuraman (1994) showed that P is a reahzation from the Dirichlet process 
with parameter a. In addition to Algorithm 3.6, Escobar (1994) presented a 
way to sample from the posterior when a Dirichlet prior is used with a location 
mixture of normals. 

Green (199-''>) proposed a framework for constructing a reversible jump MCMC 
in order to "jump" between parameter subspaces of varying dimensionality. This 
is appealing for Baycsian model determination because now prior information 
can be placed on the number of components in the model (as well as the com- 
ponent parameters) and provide effective exploration of the varying dimensions 
of parameter subspaces. Since k is now a parameter, the parameter vector of 
interest becomes 

Wfc = {6j, ...,6l,Xi,..., Xk-i,kf e flk, 

such that k Cz N. The elements of u^k are each regarded as being drawn from an 
appropriately defined prior distribution. A basic sketch of the reversible jump 
MCMC method is as follows: 

Algorithm 3.7 (Reversible Jump MCMC). 

1. Draw the value a;^ from the proposal density g{-\iVk) and target (posterior) 
distribution 7r(-). (Note that o;^ may be from a different subspace than uJk-) 

2. Let M ~ Ml U A/2 be a countable family of move types. 

(a) If move m g Mi is attempted with destination G Jlfc; then the 
acceptance of this sample is given by the appropriately defined accep- 
tance probability am\i^k,'^k)- 

(b) If move m S M2 is attempted with destination € J^fc', such that 
k ^ k' , then the acceptance of this sample is given by the appropri- 
ately defined acceptance probability a"m {'^JJk,|^k)■ 
3. Iterate. 
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Richardson and Green (1997) detail how to construct the acceptance proba- 
bilities in the above algorithm. Also, some of the possible moves m the authors 
suggest include: 

1. Updating any (or all) of 6i, . . . ,9k (an Mi— type move). 

2. Updating Ai, . . . , Xk-i (an il/i— type move). 

3. Splitting one of the mixture components into two components (an Af2— type 
move) . 

4. Merging two of the mixture components into one component (an A/2— type 
move). 

Stephens (2000a) provided a birth-and-death MCMC as an alternative to 
the reversible jump MCMC. The birth-and-death MCMC is a marked point 
process which can be viewed as a continuous-time version of the reversible jump 
MCMC with a limited number of moves. By limiting the number of moves, the 
implementation of the algorithm is simplified. 

A birth is said to occur if at time t G M+ the process is at i/j G ^k and then 
jumps to 

xP U (A, 6) = {Ai(l - A), ... , Xk-iil - A), A, 01, ... , Ok, 9} e *fc+i, (3.11) 
while a death (of the i"' component) occurs if the process jumps to 
OS f Ai Ai_i Ai_|_i Afe_i 

0i,...,6/,_i,0,+i,...,0fe| G *fc-i. (3.12) 

The overall rate that births occur is (3{ip) and the point (A, 9) is chosen according 
to some density h{if);X,9). Also, each point {Xi,9i) dies independently of the 
others according to a Poisson process with rate Sj{ip), where the overall death 
rate is 5('0) = ^ji'^)- ^ basic sketch of the birth-and-death MCMC is as 

follows: 

Algorithm 3.8 (Birth-and-Death MCMC). Set t ^ where t is the time 
of a birth or death. For 1/ ~ 0,1, . . . ,V , run the following until t > ly + I: 

1. Define the birth rate as a constant and calculate the death rate Sj{tp) 
of each component. 

2. Calculate the overall death rate S{xp) and simulate u ~ Unif(0, 1). 

3. Update t with 

log(^) 

4. Simulate the type of jump. 

(a) If the jump is a "birth", then simulate the new component {X,9) from 
the density h{rp;X,9) and update xj} as in (3.11). 

(h) If the jump is a "death", then select the new component {Xj,9j) to 
die with probability Sj{il>)/d{xp) and update ip as in (3.12). 
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5. Run I iterations of an MCMC sampler according to the current parameter 
space. 



3.4- Standard Errors 

After generating an MCMC sample using the procedures discussed in Section 
3.2, posterior standard deviations are easily obtained. With likelihood meth- 
ods, it is possible to obtain standard error estimates by using the inverse of 
the observed information matrix when implementing a Newton-type method. 
However, this may be computationally burdensome. An alternative way to re- 
port standard errors in the likelihood setting is by implementing a parametric 
bootstrap (Efron and Tibshirani (1993)). Efron and Tibshirani (1993) claim the 
parametric bootstrap should provide similar estimates to the standard errors 
compared to the method involving the information matrix. The development of 
this procedure has become useful for the mixture case as well. We outline the 
algorithm for a parametric bootstrapping scheme in the mixture setting using 
an EM algorithm as follows: 

Algorithm 3.9 (Parametric Bootstrap for Standard Errors). 

1. Find the maximum likelihood estimate xp by implementing an EM algo- 
rithm (Algorithm 3.1) based on the values i/i, 2/2: ■ ■ ■ i Vn- 

2. Generate a bootstrap sample of size n from f{y; xp) and call this sample 
yt,y^,...,y;,. 

3. Find the estimate tp for the bootstrap sample by implementing an EM 
algorithm. 

4-. Repeat steps 2 and 3 B times to generate the bootstrap sampling distribu- 
tion tp jtp , . . . jtp 

After implementing Algorithm 3.9, the bootstrap variancc-covariancc matrix 
is easily computed as the sample variance- covariance matrix of the generated val- 
ues -0^ ■0'" Thus, bootstrap standard errors are readily available. 
However, when performing a bootstrapping procedure in the mixture setting, 
one must be cognizant of the label switching problem described below. 



4. Identifiability 

In this section, we formally define identifiability for mixture distributions. This 
discussion and the definition of identifiability are adopted from McLaclilan and Peel 
(2000). 

Let J-'k denote a parametric family of fc-component mixture densities as de- 
scribed in (2.1) and T the class of all such Tk- So 

= {/fc(y,; V) : V' e and = U Tk. 

fc6N 
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Permuting the component labels of the mixture density results in J- being non- 
identifiable in ^P. We formalize this concept as a definition. 

Definition 4.1 (Identifiability). Consider 

k 

and 

k' 

/,.(y,;tA*)-^A*g(y,;0*), 

j=i 

which are both members of the class T . T is said to be identifiable for ip G 
* 'if fkiyi;ip) = fk*{yi;i>*) a.e. v[yi] if and only if (i) k = k* ; (ii) under 
permutation of the component labels, Xj ~ A* and g{yf,6j) ~ g(y^;dj) a.e. 
v[yj\ for all j = 1, . . . , fc and (Hi) Xj > and the 9j are distinct for all j . Here, 
ul'] is the underlying measure on W for g{y^,6). 

Definition 4.1 states that no element of T can arise in two different ways 
except by trivial means, such as letting some Xj = or splitting a component 
by letting Oj^ = Oj^ . 

4.1- Label Switching 

In Section 3, we saw possible estimation methods used in mixture modeling. 
These methods included a parametric bootstrap using EM algorithms to obtain 
standard error estimates and Bayesian inference via MCMC samplers. During 
the implementation of such iterative methods, one must be cognizant of the 
solutions being calculated from one iteration to the next since a given mixture 
component cannot be extracted from the likelihood. This situation occurs be- 
cause the component labels cannot be distinguished from one another due to 
the nonidentifiability in ip as established in Definition 4.1. Such a permutation 
of the component labels as in this definition is called label switching. 

There are numerous methods in the literature for dealing with label switch- 
ing (see Jasra et al. (2005) for a review of some of these techniques). One of 
the easiest methods for dealing with this issue, especially when the parame- 
ters are well-separated within the parameter space, is by imposing identifia- 
bility constraints on the parameters (such as Ai < A2 < . . . < A^). However, 
this method comes with caveats heavily emphasized in the literature (for in- 
stance, see McLaclilan and Peel (2000) and Stephens (2000b)). For example, 
consider fitting a mixture with k = 2 components with the mixing propor- 
tions close to 0.50. Imposing the identifiability constraint on the mixing pro- 
portions clearly infiuences the estimates of Oi and 82, thus creating a bias. 
Such a situation is highlighted in Celeux et al. (2000) where they present "dis- 
turbing" results when considering the various ordering constraints on a A: = 3 
component mixture of normals using an MCMC sampler. This identifiabil- 
ity can be imposed after the simulations have been completed, as Stephens 
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(1997) demonstrates for an MCMC sample of size by relabeling the sample 
(^^(1)^ ^(2)^ _ ^ ijr(^)) and applying permutations tti, 7r2, . . . , ttn such that the 
permuted sample (7ri(*(i)), 7r2(*'^^'), . . . , 7rAr(*(^))) satisfies the identifiability 
constraints. Since there is not always a clear choice of labeling, Richardson and Green 
(1997) stress post-processing the simulations under different permutations of the 
labels to determine an appropriate choice. 

One alternative method is to consider bootstrapping in mixtures. McLaclilan and Peel 
(2000) point out that label switching can usually be avoided by setting the EM 
algorithm's starting values to the maximum likelihood estimates, since EM al- 
gorithms are (generally) very dependent on the starting values. 

Next, note that since the likelihood of a /c-component mixture model is in- 
variant under permutation of the component labels, it effectively has k\ modes. 
Label switching is often presented in the context of Bayesian mixture modeling 
since the posterior distribution will also have this property under a symmetric 
prior. The first Bayesian method we will consider is a decision theoretic ap- 
proach as implemented in Celeux et al. (2000), Stephens (2000b), Hiun ct al. 
(2003), and Jasra et al. (2005). 

Consider estimating the parameters from the mixture model in (2.1). In a 
Bayesian framework, summarizing their posterior distributions will be viewed 
as choosing an action, a, from the action space, A. Then, define a loss function 
£ : ^ X I— > Since a will be a chosen vector of parameters, let a = ip. The 
objective is to find a value of ip that minimizes the posterior expected loss, or 
risk. This results in the Bayes estimator, which is defined as 



The expectation on the right hand side is the risk function, which is taken 
over the posterior distribution of ip\Y. When considering only the class of loss 
functions that are invariant under permutations of xp, the Bayes estimator in 
(4.1) becomes unaffected by label indexes. Once an appropriate loss function has 
been chosen, these procedures can be summarized into the following algorithm: 

Algorithm 4.1 (Loss-Based Algorithm for Label Sw^itching). 

1. Fix a value oftp. 

2. Generate a large number of realizations from the posterior distribution of 
■01 Y using an MCMC sampler. Call these realizations xp^^^ , xp^^^ , ■ . . , ^p^^'^ . 

3. Discard the first M iterations for burn-in. 

4. Calculate the MCMC ergodic averages (called Monte Carlo risk in Stephens 



ip 



argmin E^|y['C(V'; V") 



(4.1) 



(20U0b)) as 



W) = i?^|y[£(0;tA)] 




5. 



Find the Bayes estimator ij) = argmin^ 7^(1/?). 
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This entire process hinges on the selection of an appropriate loss function, 
which may be quite challenging. Yet if the loss function £ is chosen to be in- 
variant to permutations of the component labels, then label switching will not 
hamper the resulting Bayesian estimates. Stephens (2000b) recommends running 
Algorithm 4.1 from several starting points and choosing the Bayes estimate that 
provides the best local optimum found. 

Another procedure used within the Bayesian framework is by Chung et al. 
(2004), who suggest assigning as few as one observation to a component a priori. 
This amounts to using data-dependent priors where one or more observations 
are assigned to each component with certainty. The point is to apply enough 
information to break the symmetry of the likelihood and flatten the posterior 
density over A;! — 1 nuisance regions, which are the duplicate modes resulting 
from the permutations of the components. The posterior density in the sampler 
will now reflect a modified likelihood function which accommodates a density 
where one (or more) observations were assigned to each component. The major 
limitation of this approach is to what extent one is willing to accept preclassi- 
fying certain observations. 

5. Software 

In this section, we briefly outline a few software packages capable of performing 
analysis of mixture models. There are many packages which specialize in fitting 
certain mixture models, but the packages we mention here provide a little more 
versatility with respect to the selection of functions they offer. 

The SAS TRAJ procedure (Jones et al. (2001)) analyzes longitudinal data by 
fitting a mixture model. Specifically, PROC TRAJ fits semiparametric discrete 
mixture models to longitudinal data. Distributions available in this procedure 
include Bernoulli, censored normal, Poisson, and zero-inflated Poisson. 

The R programming language has a few packages available for analyzing mix- 
ture models. The mclust package (Fraley and Raftery (2006)) provides model- 
based clustering, density estimation, discriminant analysis, and the analysis 
of mixtures of (multivariate) normals under various parameterizations of the 
component-specific variance-covariance matrices. EM algorithms are used for 
estimation and BIC is used for determining the number of components. 

Another package available in R is the mixtools package (Young et al. (2008)). 
This package fits a wide array of mixture models including mixtures of (multi- 
variate) normals, mixtures of regressions, mixtures of Poisson regressions, mix- 
tures of logistic regressions, and mixtures of multinomials. EM algorithms are 
used for estimation in all of these cases and there is also a Metropolis-Hastings 
algorithm for the mixture of regressions setting. There are bootstrapping func- 
tions for testing the number of components as well as estimating the standard 
errors. There is also a stochastic semiparametric EM algorithm for estimating 
a nonparametric multivariate mixture model. 

One commercially available software program is Mplus (Muthcn and Muthcn 
(2008)). This program provides tools for mixture models, latent class analysis, 
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and survival mixtures, just to name a few. Mplus has the capability of carrying 
out these analyses for observed variables that are continuous, censored, binary, 
ordinal, nominal, or any combinations of these types. Mplus is a very flexible 
tool for researchers and provides many routines in addition to those for mixture 
analysis. 

6. Conclusion 

The area of finite mixture models has a rich literature demonstrating their appli- 
cability in a wide variety of fields. This article attempted to codify an overview 
of mixture modeling by providing an introduction to the topic, discussion of rel- 
evant issues in estimation, and outlining various algorithms. Researchers unfa- 
miliar with mixture modeling will hopefully gain an appreciation of their utility 
as well as an understanding of their limitations. We hope the reader gained a 
greater breadth of knowledge to aid them as they proceed with more specific 
literature on the various tools and issues regarding mixture modeling. 
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